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Abstract 

Interesting theoretical associations have been established by recent papers between 
the fields of active learning and stochastic convex optimization due to the common role 
of feedback in sequential querying mechanisms. In this paper, we continue this thread 
in two parts by exploiting these relations for the first time to yield novel algorithms in 
both fields, further motivating the study of their intersection. First, inspired by a recent 
optimization algorithm that was adaptive to unknown uniform convexity parameters, 
we present a new active learning algorithm for one-dimensional thresholds that can yield 
minimax rates by adapting to unknown noise parameters. Next, we show that one can 
perform d-dimensional stochastic minimization of smooth uniformly convex functions 
when only granted oracle access to noisy gradient signs along any coordinate instead of 
real-valued gradients, by using a simple randomized coordinate descent procedure where 
each line search can be solved by 1-dimensional active learning, provably achieving the 
same error convergence rate as having the entire real-valued gradient. Combining these 
two parts yields an algorithm that solves stochastic convex optimization of uniformly 
convex and smooth functions using only noisy gradient signs by repeatedly performing 
active learning, achieves optimal rates and is adaptive to all unknown convexity and 
smoothness parameters. 


1 Introduction 

The two fields of convex optimization and active learning seem to have evolved quite in¬ 
dependently of each other. Recently, p]J pointed out their relatedness due to the inherent 
sequential nature of both fields and the complex role of feedback in taking future actions. 
Following that, [2] made the connections more explicit by tying together the exponent used 
in noise conditions in active learning and the exponent used in uniform convexity (UC) 
in optimization. They used this to establish lower bounds (and tight upper bounds) in 
stochastic optimization of UC functions based on proof techniques from active learning. 
However, it was unclear if there were concrete algorithmic ideas in common between the 
fields. 

Here, we provide a positive answer by exploiting the aforementioned connections to 
form new and interesting algorithms that clearly demonstrate that the complexity of d- 
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dimensional stochastic optimization is precisely the complexity of 1-dimensional active 
learning. Inspired by an optimization algorithm that was adaptive to unknown uniform 
convexity parameters, we design an interesting one-dinrensional active learner that is also 
adaptive to unknown noise parameters. This algorithm is simpler than the adaptive active 
learning algorithm proposed recently in [3] which handles the pool based active learning 
setting. 

Given access to this active learner as a subroutine for line search, we show that a simple 
randomized coordinate descent procedure can minimize uniformly convex functions with a 
much simpler stochastic oracle that returns only a Bernoulli random variable representing 
a noisy sign of the gradient in a single coordinate direction, rather than a full-dimensional 
real-valued gradient vector. The resulting algorithm is adaptive to all unknown UC and 
smoothness parameters and achieve minimax optimal convergence rates. 

We spend the first two sections describing the problem setup and preliminary insights, 
before describing our algorithms in sections 3 and 4. 

1.1 Setup of First-Order Stochastic Convex Optimization 

First-order stochastic convex optimization is the task of approximately minimizing a convex 
function over a convex set, given oracle access to unbiased estimates of the function and 
gradient at any point, using as few queries as possible OH)- 

We will assume that we are given an arbitrary set S C M. d of known diameter bound 
R = max^^gs ||a: — y\\. A convex function / with x* = argmin X £S f{x) is said to be 
/c-uniformly convex if, for some A > 0, k > 2, we have for all x, y £ S 

f(y ) > f{x) + V/(x) T (y -x) + ^\\x- y\\ k 

(strong convexity arises when k = 2). / is L-Lipschitz for some L > 0 if ||V/(x)||* < L 
(where ||.||* is the dual norm of ||.||); equivalently for all x,y € S 

\f(x)- f(y)\ < L\\x — y\\ 

A differentiable / is H-strongly smooth (or has a H-Lipschitz gradient) for some H > A if 
for all x,y € S, we have ||V/(x) — V/(y)||* < H\\x — y ||, or equivalently 

f{y) < f{x) + V/(x) T (y - x) + y ||x - y \\ 2 

In this paper we shall always assume ||.|| = ||.||* = ||.||2 and deal with strongly smooth and 
uniformly convex functions with parameters A > 0, k > 2, L, H >0. 

A stochastic first order oracle is a function that accepts x € S, and returns 

(, f(x),g(x) ) e R d+1 where E [f{x)} = f(x),E[g(x )] = V/(x) 

(these unbiased estimates also have bounded variance) and the expectation is over any in¬ 
ternal randomness of the oracle. 

An optimization algorithm is a method that sequentially queries an oracle at points in S 
and returns xt as an estimate of the optimum of / after T queries (or alternatively tries 
to achieve an error of e) and their performance can be measured by either function error 
f(xx) — f(x*) or point error \\xt — x*||. 
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1.2 Stochastic Gradient-Sign Oracles 

Define a stochastic sign oracle to be a function of x € S, j € {l...d}, that returns 

§j(x) € {+,—} whercQ \i](x) — 0.5| = 0^[V/(x)]j^ and ?/(x) = Pr ( §j(x ) = +|x) 

where Sj(x) is a noisy sign([V/(x)]j) and [V/(x)]j is the j-th coordinate of V/, and the 
probability is over any internal randomness of the oracle. This behavior of rj(x) actually 
needs to hold only when |[V/(x)]j| is small. 

In this paper, we consider coordinate descent algorithms that are motivated by appli¬ 
cations where computing the overall gradient, or even a function value, can be expensive 
due to high dimensionality or huge amounts of data, but computing the gradient in any one 
coordinate can be cheap. [5j mentions the example of min x -t 11— 6|| 2 + -)||x|| 2 for some 
n x d matrix A (or any other regularization that decomposes over dimensions). Computing 
the gradient A T (Ax — b) + x is expensive, because of the matrix-vector multiply. However, 
its j-th coordinate is 2 A^ T (Ax — b) + Xj and requires an expense of only n if the residual 
vector Ax — b is kept track of (this is easy to do, since on a single coordinate update of x, 
the residual change is proportional to A J , an additional expense of n). 

A sign oracle is weaker than a first order oracle, and can actually be obtained by return¬ 
ing the sign of the first order oracle’s noisy gradient if the mass of the noise distribution 
grows linearly around its zero mean (argued in next section). At the optimum along coor¬ 
dinate j, the oracle returns a ±1 with equal probability, and otherwise returns the correct 
sign with a probability proportional to the value of the directional derivative at that point 
(this is reflective of the fact that the larger the derivative’s absolute value, the easier it 
would be for the oracle to approximate its sign, hence the smaller the probability of error). 
It is not unreasonable that there may be other circumstances where even calculating the 
(real value) gradient in the i -th direction could be expensive, but estimating its sign could 
be a much easier task as it only requires estimating whether function values are expected 
to increase or decrease along a coordinate (in a similar spirit of function comparison oracles 
[6], but with slightly more power). 

We will also see that the rates for optimization crucially depend on whether the gradient 
noise is sign-preserving or not. For instance, with rounding errors or storing floats with small 
precision, one can get deterministic rates as if we had the exact gradient since the rounding 
or lower precision doesn’t flip signs. 

1.3 Setup of Active Threshold Learning 

The problem of one-dimensional threshold estimation assumes you have an interval of length 
R, say [0,1?]. Given a point x, it has a label y € {+, —} that is drawn from an unknown 
conditional distribution r](x) = Pr (Y = +\X = x) and the threshold t is the unique point 
where r/(x) = 1/2, with it being larger than half on one side of t and smaller than half on 
the other (hence it is more likely to draw a + on one side of t and a — on the other side). 

The task of active learning of threshold classifiers allows the learner to sequentially 
query T (possibly dependent) points, observing labels drawn from the unknown conditional 

1 / = 0(g) means / = fl(g) and / = O(g) (rate of growth) 
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distribution after each query, with the goal of returning a guess xt as close to t as possible. 
In the formal study of classification (cf. jTj), it is common to study minimax rates when 
the regression function rj(x) satisfies Tsybakov’s noise or margin condition (TNC) with 
exponent k at the threshold t. Different versions of this boundary noise condition are used 
in regression, density or level-set estimation and lead to an improvement in minimax optimal 
rates (for classification, also cf. [ 8 j, [3]). Here, we present the version of TNC used in [9] : 

M\x — t\ k ~ l > | rj(x) — 1/2| > fj,\x — whenevei|j| \r/(x) — 1/2| < eo 

for some constants M > n > 0, eo > 0, k > 1. 

A standard measure for how well a classifier h performs is given by its risk, which 
is simply the probability of classification error (expectation under 0 — 1 loss), 7 Z(h) = 
Pr [h(x) 7 ^ y\. The performance of threshold learning strategies can be measured by the 
excess classification risk of the resultant threshold classifier at xt compared to the Bayes 
optimal classifier at t as given by 0 

J i 2 ,w-i|^ in 

xtM 

In the above expression, akin to [9], we use a uniform marginal distribution for active 
learning since there is no underlying distribution over x. Alternatively, one can simply 
measure the one-dimensional point error |xt — t\ in estimation of the threshold. Minimax 
rates for estimation of risk and point error in active learning under TNC were provided in 
[9] and are summarized in the next section. 

1.4 Summary of Contributions 

Now that we have introduced the notation used in our paper and some relevant previous 
work (more in the next section), we can clearly state our contributions. 

• We generalize an idea from [ID] to present a simple epoch-based active learning al¬ 
gorithm with a passive learning subroutine that can optimally learn one-dimensional 
thresholds and is adaptive to unknown noise parameters. 

• We show that noisy gradient signs suffice for minimization of uniformly convex func¬ 
tions by proving that a random coordinate descent algorithm with an active learning 
line-search subroutine achieves minimax convergence rates. 

• Due to the connection between the relevant exponents in the two fields, we can combine 
the above two methods to get an algorithm that achieves minimax optimal rates and 
is adaptive to unknown convexity parameters. 

• As a corollary, we argue that with access to possibly noisy non-exact gradients that 
don’t switch any signs (rounding errors or low-precision storage are sign-preserving), 
we can still achieve exponentially fast deterministic rates. 

2 Note that \x — t\ < So := (jfr) fc_1 => | r/(x) — 1/2| < eo \x — t\ < (jjfj 
3 aV b := max(a, b) and a Ab := min(a, b ) 
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2 Preliminary Insights 

2.1 Connections Between Exponents 

Taking one point as x* in the definition of UC, we see that 


\f(x)~ f{x*)\ > ^\\x-x*\\ k 

Since ||V/(x)||||a: — ®*|| > V f(x) T (x — x*) > f(x) — f(x*) (by convexity), 


||V/(x)-0|| > ^\\x-x*\\ k 1 

Another relevant fact for us will be that uniformly convex functions in d dimensions are 
uniformly convex along any one direction, or in other words, for every fixed x € S and fixed 
unit vector u € M d , the univariate function of a defined by /^^(a) := f(x + era) is also UC 
with the same parameter^. For u = ej, 

|[V/(®)]j-0| > 


where x* = x + a* ej and a* = arg miri{ a | I+ae . e 5 | f(x+aej). This uncanny similarity to the 
TNC (since V/(x*) = 0) was mathematically exploited in [2] where the authors used a lower 
bounding proof technique for one-dimensional active threshold learning from (9j to provide 
a new lower bounding proof technique for the d-dimensional stochastic convex optimization 
of UC functions. In particular, they showed that the minimax rate for 1-dimensional active 
learning excess risk and the d-dimensional optimization function error both scaled liktU 
@ (r~ 2 k- 2 ^, anc l that the point error in both settings scaled like 0 ^T~ 2k ~ 2 ^j , where k is 
either the TNC exponent or the UC exponent, depending on the setting. The importance of 
this connection cannot be emphasized enough and we will see this being useful throughout 
this paper. 

As mentioned earlier [9] require a two-sided TNC condition (upper and lower growth con¬ 
dition to provide exact tight rate of growth) in order to prove risk upper bounds. On a 
similar note, for uniformly convex functions, we will assume such a Local fc-Strong Smooth¬ 
ness condition around directional minima 


Assumption LkSS : for all j G {l...d} \[V f(x)}j - 0\ < A\\x - x*\\ k 1 

for some constant A > A/2, so we can tightly characterize the rate of growth as 

|[V/(x)] i -O|=0(||x-x*|| fc - 1 ) 

This condition is implied by strong smoothness or Lipschitz smooth gradients when k = 2 
(for strongly convex and strongly smooth functions), but is a slightly stronger assumption 
otherwise. 

4 Since / L UC, f x , u (a) > /.,,(0)+aV/,,,(0) + #|a| fc 
5 we use O, 0 to hide constants and polylogarithmic factors 
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2.2 The One-Dimensional Argument 

The basic argument for relating optimization to active learning was made in [2j in the 
context of stochastic first order oracles when the noise distribution P(~) is unbiased and 
grows linearly around its zero mean, i.e. 

roo rt 

/ dP(z) = | and / dP(^) = Q(t) 

Jo Jo 

for all 0 < t < to, for constants to (similarly for —to < t < 0). This is satisfied for gaussian, 
uniform and many other distributions. We reproduce the argument for clarity and then 
sketch it for stochastic signed oracles as well. 

For any x £ 5, it is clear that f x ,j(&) := f(x + aej) is convex; its gradient V/ X j(a) := 

[V f(x+aej)]j is an increasing function of a that switches signs at a* := arg min{ Q | x+aeje t,'j f x j(a), 
or equivalently at directional minimum x* := x + aJjej. One can think of sign([V/(x)]j) as 
being the true label of x, sign([V/(x)]j + z) as being the observed label, and finding x* as 
learning the decision boundary (point where labels switch signs). Define regression function 

ri(x) := Pr (sign([V/(x)]j + z) = +|x) 

and note that minimizing f Xo j corresponds to identifying the Bayes threshold classifier as x* 
because the point at which rj(x) = 0.5 or [V/(x)]j = 0 is x*. Consider a point x = x* + tej 
for t > 0 with [V/(x)]j > 0 and hence has true label + (a similar argument can be made 

for t < 0). As discussed earlier, |[V/(x)]j| = 0^||x — x*|| fc_1 ^ = @(f fc_1 ). The probability 

of seeing label + is the probability that we draw z in ( — [V/(x)]j, oo) so that the sign of 
[V f(x)]j + z is still positive. Hence, the regression function can be written as 

r]{x) = Pr ([Vf(x)]j + z > 0j 

= Pr(z > 0) + Pr ( - [V/(x)]j < 2 < o) =0.5 + ©([V/Or)]^ 

=► b(z)-il = ®([^/( x )li) = 0(^ -1 ) = ©(ix-x*^" 1 ) 

Hence, r]{x) satisfies the TNC with exponent k, and an active learning algorithm (next sub¬ 
section) can be used to obtain a point xt with small point-error and excess risk. Note that 
function error in convex optimization is bounded above by excess risk of the corresponding 
active learner using eq (JH) because 


fj(x T )-fj(x*) = 


x T Vx*. 


[V/(®)]jdx 


X T f\X* 


x T Vx*. 


= 01 


J \2rj(x) — l|dx 


^tAi* 


= q(k(x t ) 

Similarly, for stochastic sign oracles (Sec. II.2p . using rj(x) = Pr ( §j(x ) = +), 

Hx)-\\ = ©([V/(x)]j) = ©(Hx-x^f- 1 ) 
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2.3 A Non-adaptive Active Threshold Learning Algorithm 

One can use a grid-based probabilistic variant of binary search called the BZ algorithm 
to approximately learn the threshold efficiently in the active setting, in the setting that rj(x) 
satisfies the TNC for known k, /i, M (it is not adaptive to the parameters of the problem - 
one needs to know these constants beforehand). The analysis of BZ and the proof of the 
following lemma are discussed in detail in Theorem 1 of m, Theorem 2 of [9] and the 
Appendix of [2]. 

Lemma 1. Given a 1-dimensional regression function that satisfies the TNC with known 
parameters //, k, then after T queries, the BZ algorithm returns a point t such that \t — 1| = 

1 ~ k 

0(T 2 fc- 2 ) anc ( th e excess risk is 0(T 2fc - 2 ). 

Due to the described connection between exponents, one can use BZ to approximately 
optimize a one dimensional uniformly convex function fj with known uniform convexity 
parameters A, k. Hence, the BZ algorithm can be used to find a point with low function 
error by searching for a point with low risk. This, when combined with Lemma [Q yields 
the following important result. 

Lemma 2. Given a 1-dimensional k-UC and LkSS function fj, a line search to find xt 
close to x* up to accuracy \xt~ x*-\ < rj in point-error can be performed in 0(1 /ij 2k ~' 2 ) steps 
using the BZ algorithm. Alternatively, in T steps we can find xt such that f(xr) — f( x *j) = 
0(T" 3*h). 


3 A 1-D Adaptive Active Threshold Learning Algorithm 

We now describe an algorithm for active learning of one-dimensional thresholds that is 
adaptive, meaning it can achieve the minimax optimal rate even if the TNC parameters 
M, /i, k are unknown. It is quite different from the non-adaptive BZ algorithm in its flavour, 
though it can be regarded as a robust binary search procedure, and its design and proof 
are inspired from an optimization procedure from m that is adaptive to unknown UC 
parameters A ,k. 

Even though m considers a specific optimization algorithm (dual averaging), we observe 
that their algorithm that adapts to unknown UC parameters can use any optimal convex 
optimization algorithm as a subroutine within each epoch. Similarly, our adaptive active 
learning algorithm is epoch-based and can use any optimal passive learning subroutine in 
each epoch. We note that [3] also developed an adaptive algorithm based on disagreement 
coefficient and VC-dimension arguments, but it is in a pool-based setting where one has 
access to a large pool of unlabeled data, and is much more complicated. 

3.1 An Optimal Passive Learning Subroutine 

The excess risk of passive learning procedures for 1-d thresholds can be bounded by O(T -1 / 2 ) 
(e.g. see Alexander’s inequality in m to avoid \J logT factors from ERM/VC arguments) 
and can be achieved by ignoring the TNC parameters. 
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Consider such a passive learning procedure under a uniform distribution of samples 
(mimicked by active learning by querying the domain uniformly) in a bal@ B{xq, R) around 
an arbitrary point xq of radius R that is known to contain the true threshold t. Then 
without knowledge of M, (i, k, in T steps we can get a point xt close to the true threshold 
t such that with probability at least 1 — 6 

xtM 

C S R 

VT 

for some constant Cs- Assuming xt lies inside the TNC region, 

xtM 

"/ 11 
X'j'S/t 


-t I 


fc-i 


xtM 

dx < J |277(a;) — 1| dx 

XT^t 


lZ(x) — TZ(t) = / |277(a;) — 1 |dx < 


Hence 


V\xT-t\ k 


< Since k l / k < 2, w.p. at least 1 — 5 we get a point-error 


I XT — 1 1 < 2 


C S R 

.fiy/T. 


l/k 


( 2 ) 


We assume that xt lies within the TNC region since the interval \r](x) — ^ < eo has 

at least constant width |a; — t\ < <5o = (eoit will only take a constant number 
of iterations to find a point within it. A formal way to argue this would be to see that if 
the overall risk goes to zero like then the point cannot stay outside this constant sized 
region of width do where ( 77 ( 2 ;) — 1/2| < eo, since it would accumulate a large constant risk 

of at least f fj\x — t\ k ~ l = . So as long as T is larger than a constant To := —375 

t M 0 

our bound in eq [2] holds with high probability (we can even assume we waste a constant 

number of queries to just get into the TNC region before using this algorithm). 


3.2 Adaptive One-Dimensional Active Threshold Learner 

Algorithm Q] is a generalized epoch-based binary search, and we repeatedly perform pas¬ 
sive learning in a halving search radius. Let the number of epochs be E := log f ^ 

(if 7 constant C~ > 2) and 6 := 2<5/logT < 5/E. Let the time budget per epoch be 
N := T/E (the same for every epoch) and the search radius in epoch e € {1, ■■■,E} shrink 
as R e := 2 ~ e+1 R. 

Let us define the minimizer of the risk within the ball of radius R e centered around x e -i 
at epoch e as 

x* = argmin {/R(x) : 2 : £ B(x e - 1 , R e )} 

Note that x* = t iff t G B(x e -i, R e ) and will be one end of the interval otherwise. 


Define B{x , R) := [x — R, x + R] 












Input: Domain S of diameter R, oracle budget T, confidence <5 

Black Box: Any optimal passive learning procedure P(x, R, N ) that outputs an 

estimated threshold in B(x,R) using N queries 

Choose any x 0 € S, Ri = R,E = log c ^ gT , N = g 

1: while 1 < e < E do 
2 : X e <- P(x e -l,R e , N) 

3: R e+ + l 

4: end while 

Output: xe 

Algorithm 1: Adaptive Threshold Learner 


Theorem 1. In the setting of one-dimensional active learning of thresholds, Algorithm 1 
adaptively achieves IZ(xe) — 7 Z(t) = 6 (r~ 2k ~ 2 ^j with probability at least 1 — 5 in T queries 
when the unknown regression function r/(x) has unknown TNC parameters p,,k. 

Proof. Since we use an optimal passive learning subroutine at every epoch, we know that 
after each epoch e we have with probability at least 1 — 5 @ 

W(*e) - < ^f= < (3) 

Since rj(x) satisfies the TNC (and is bounded above by 1), we have for all x 

p\x — t| fc_1 < | rj{x) — 1/2| < 1 


If the set has diameter R, one of the endpoints must be at least R/2 away from t, and hence 
we get a limitation on the maximum value of p, as p < ■ Since k > 2 and E > 2, 

and 2~ e = Cf using simple algebra we get 

2(k-2)E+2 


h < 


4.2 _E 2^“ 1 ) £ '2l fc_1 ) 4'2~ E 2( k - 1 ' ) 


4C^2 fc 1 


(R/2) 


k -1 




(2 ~ E R) k 


-l 


R 


k —1 
'E+1 


log T 
2 T 


We prove that we will be appropriately close to t after some epoch e* by doing case analysis 
on /i. When the true unknown p. is sufficiently small, i.e. 


h < 


4C?2 


k-1 


R 


k— 1 


log T 
2 T 


(4) 


then we show that we’ll be done after e* = 1. 
2 < e* < E if the true qi lies in the range 


Otherwise, we will be done after epoch 


4C^2 fc " 1 


R: 


k-l 


logT < < 4C',~2 fc -' 


2 T 


R 


k-l 
■e* + l 


log T 

2 T 


(5) 


'By VC theory for threshold classifiers or similar arguments in [13] , C| ~ log(l/5) ~ loglogT since 
S ~ 5/ log T. We treat it as constant for clarity of exposition, but actually lose log log T factors like the high 
probability arguments in in and 0 
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To see why we’ll be done, equations (J4J) and © imply R e *+\ < 2 
e* and plugging this into equation ([3]) with R e * = 2R e * + \, we get 


8 C? logT\ 2fc-2 
—— ) arter epocli 


K(x e *)-K(x* e *)<C~ s R e 


log T 

2 T 


= O 


'log T\ 2fc - 2 


( 6 ) 


There are two issues hindering the completion of our proof. The first is that even though 
x\ = t to start off with, it might be the case that is far away from t since we are chopping 
the radius by half at every epoch. Interestingly, in lemma [3] we will prove that round e* is 
the last round up to which x* = t. This would imply from eq ([6]) that 

K(x e .)-K(t) = o(r-5sb) (7) 

Secondly we might be concerned that after the round e*, we may move further away from 
t in later epochs. However, we will show that since the radii are decreasing geometrically 
by half at every epoch, we cannot really wander too far away from x e *. This will give us a 
bound (see lemma [dj like 

77(x E )-77(x e *) = 0(r'^) (8) 

We will essentially prove that the final point x e * of epoch e* is sufficiently close to the true 
optimum t, and the final point of the algorithm xe is sufficiently close to x e *. Summing eq 
(J7]) and eq ([S]) yields our desired result. 

Lemma 3. For all e < e*, conditioned on having x* e _ v = t, with probability 1 — 5 we have 
x* = t. In other words, up to epoch e*, the optimal classifier in the domain of each epoch 
is the true threshold with high probability. 


Proof, x* = t will hold in epoch e if the distance between the first point x e _i in the epoch 
e is such that the ball of radius R e around it actually contains t, or mathematically if 
|x e _i — 1 1 < R e . This is trivially satified for e = 1, and assuming that it is true for epoch 
e — 1 we will show show by induction that it holds true for epoch e < e* w.p. 1 — 6. Notice 
that using equation ([21), conditioned on the induction going through in previous rounds (t 
being within the search radius), after the completion of round e — 1 we have with probability 
1-6 


X e -l ~t\<2 


CfiRe—1 


1 jk 


fVt/e 


If this was upper bounded by R e , then the induction would go through. So what we would 

i 
k 

< R e . Since i? e _i = 2 R e , we effectively want to show 


really like to show is that 2 


u 


y/TjE 


2 k Cg2R e 


rp < Rf or equivalently that for all e < e* we would like to have ~ k _ r 

rl e 


4C i 2 fe - 1 


T — P 


Since E < lo ^ T , we would be achieving something stronger if we showed 


rI- 1 



< h 


10 


















which is known to be true for every epoch up to e* by equation ([5]). 


□ 


Lemma 4. For all e* < e < E, TZ{x e ) — 1Z(x e *) < = 0 (t 2k ~ 2 ) w.p. 1 — 5, ie after 

epoch e*, we cannot deviate much from where we ended epoch e*. 

Proof. For e > e*, we have with probability at least 1 — 8 


U(x e ) - lZ(x e -i) < TZ(x e ) - TZ(x* e ) < 


C- s R e 

sfTjE 


E-e * 


and hence even for the final epoch E, we have with probability (1 — 5) 

E E 

n(x E )-K(x e *)= J2 mxe)-n-xe- 1 )]< E nfik 

e=e*+l e =e*+lV 1 /x j 

Since the radii are halving in size, this is upper bounded (like equation (}6]) ) by 

Ft.Ftc Ftp* ~ / _ fc 

5 :[l/2 + 1/4 + 1/8 + ■■■] < ? =Q(r~^ 


VtJe 


VT/E 


□ 


These lemmas justify the use of equations (J7]) and dH), whose sum yields our desired 
result. Notice that the overall probability of success is at least (1 — S) E >1 — 5, hence 
concluding the proof of the theorem. 

□ 


4 Randomized Stochastic-Sign Coordinate Descent 

We now describe an algorithm that can do stochastic optimization of fc-UC and LkSS 
functions in d > 1 dimensions when given access to a stochastic sign oracle and a black-box 
1-D active learning algorithm, such as our adaptive scheme from the previous section as 
a subroutine. The procedure is well-known in the literature, but the idea that one only 
needs noisy gradient signs to perform minimization optimally, and that one can use active 
learning as a line-search procedure, is novel to the best of our knowledge. 

The idea is to simply perform random coordinate-wise descent with approximate line 
search, where the subroutine for line search is an optimal active threshold learning algorithm 
that is used to approach the minimum of the function along the chosen direction. Let the 
gradient at epoch e be called V e _i = V/(x e _i), the unit vector direction of descent d e be 
a unit coordinate vector chosen randomly from {l...d}, and our step size from x e -i be a e 
(determined by active learning) so that our next point is x e := x e -\ + a e d e . 

Assume, for analysis, that the optimum of / e (a) := f(x e -1 + oid e ) is 

a* := arg min/(x e _i + ad e ) and x* := x e -i + oF e d e 

OL 

where (due to optimality) the derivative is 

V/ e (a*) = 0 = Vf(x* e ) T d e (9) 
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The line search to find a e and x e that approximates the minimum x* e can be accomplished 
by any optimal active learning algorithm algorithm, once we fix the number of time steps 
per line search. 


4.1 Analysis of Algorithm [2] 

Input: set S of diameter R, query budget T 

Oracle: stochastic sign oracle Of(x,j) returning noisy sign([V/(x)]t 
BlackBox: algorithm LS(x,d,n ) : line search from x, direction d, for n steps 
Choose any xq € S, E = d(logT) 2 
1: while 1 < e < E do 

2: Choose a unit coordinate vector d e from uniformly at random 

3: x e <r- LS(x e -i,d e ,T/E) using Of 

4: e <- e + 1 

5: end while 
Output: xe 

Algorithm 2: Randomized Stochastic-Sign Coordinate Descent 

Let the number of epochs be E = d(logT) 2 , and the number of time steps per epoch is 
T/E. We can do a line search from x e -i> to get x e that approximates x* e well in function error 
in T/E = O(T) steps using an active learning subroutine and let the resulting function-error 

be denoted by e' = O (t~ 2k ~ 2 ^j. 


f(x e ) < f(x* e ) + e' 

Also, LkSS and UC allow us to infer (for k* = , p e# 1/k + 1/k* = 1) 


f(x e -l )-/(®e) > ^||Se-l-®el| fe > | V J_i4|^ 

Eliminating f(x*) from the above equations, subtracting f(x*) from both sides, denoting 
A e := f(x e ) — f(x*) and taking expectations 


E[A e ] < E[A e _i] - 


2A k 


-E 


I V e _irfe 


+ e 


Sincd E [|'V J_ 1 d e \ k * \d 1 ,..., d e - 1 


= 4HV e -i 


> ^||V e _i|| fc * we get 


E[A e ] < E[A e _i] - 


2dA k 


-E 


|V R _ 


e—11 


+ e 


By convexity, Cauchy-Schwartz and ucH, ||V e _i|| fc ‘ > (|) 1/fc : L A e _i, we get 




8 k > 2 =>■ l < fc *<2 => ||.|| fc . >||.|| 2 

9 At, < [Vtl(Ze-! - £C*)] fc < ||V e _lf te-l -X*\\ k < UVe-lff A e -1 
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Defining^ C := 4 (jx) k < 1, we get the recurrence 


E[A e ] --<(l-C)[ E[A e _!] - 


C 


Since E = d(logT) 2 and Ao < L||xo — x*|| < LR, after the last epoch, we have 
E[A e ]-^ < (1 -C) e (ao-£) < exp { — Cd(log T) 2 } A 0 


C 


< LRT~ CdlogT 

As long as T > exp {(2A/A) fc }, a constant, we have CdlogT > 1 and 

E[A e \ = 0(e') + o(r- 1 ) = o(t~ 


k 

" 2k —2 


which is the desired result. Notice that in this section we didn’t need to know A, A, k, because 
we simply run randomized coordinate descent for E = d(logT) 2 epochs with T/E steps 
per subroutine, and the active learning subroutine was also adaptive to the appropriately 
calculated TNC parameters. In summary, 


Theorem 2. Given access to only noisy gradient sign information from a stochastic sign 
oracle, Randomized Stochastic-Sign Coordinate Descent can minimize UC and LkSS func- 

~ _ k 

tions at the minimax optimal convergence rate for expected function error of 0(T 2fc - 2 ) 
adaptive to all unknown convexity and smoothness parameters. As a special case for k = 2, 
strongly convex and strongly smooth functions can be minimized in 0(1 /T) steps. 


4.2 Gradient Sign-Preserving Computations 

A practical concern for implementing optimization algorithms is machine precision, the 
number of decimals to which real numbers are stored. Finite space may limit the accuracy 
with which every gradient can be stored, and one may ask how much these inaccuracies may 
affect the final convergence rate - how is the query complexity of optimization affected if the 
true gradients were rounded to one or two decimal points? If the gradients were randomly 
rounded (to remain unbiased), then one might guess that we could easily achieve stochastic 
first-order optimization rates. 

However, our results give a surprising answer to that question, as a similar argument 
reveals that for UC and LkSS functions (with strongly convex and strongly smooth being 
a special case), our algorithm achieves exponential rates. Since rounding errors do not flip 
any sign in the gradient, even if the gradient was rounded or decimal points were dropped as 
much as possible and we were to return only a single bit per coordinate having the true signs, 
then one can still achieve the exponentially fast convergence rate observed in non-stochastic 
settings - our algorithm needs only a logarithmic number of epochs, and in each epoch active 
learning will approach the directional minimum exponentially fast with noiseless gradient 
signs using a perfect binary search. In fact, our algorithm is the natural generalization for 
a higher-dimensional binary search, both in the deterministic and stochastic settings. 

We can summarize this in the following theorem: 

10 Since 1 < k* < 2 and A > A/2, we have C < 1 
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Theorem 3. Given access to gradient signs in the presence of sign-preserving noise (such as 
deterministic or random rounding of gradients, dropping decimal places for lower precision, 
etc), Randomized Stochastic-Sign Coordinate Descent can minimize UC and LkSS functions 
exponentially fast, with a function error convergence rate of 0(exp{—T}). 

5 Discussion 

While the assumption of smoothness is natural for strongly convex functions, our assumption 
of LkSS might appear strong in general. It is possible to relax this assumption and require 
the LkSS exponent to differ from the UC exponent, or to only assume strong smoothness 
- this still yields consistency for our algorithm, but the rate achieved is worse. m and 

[2] both have epoch based algorithms that achieve the minimax rates under just Lipschitz 
assumptions with access to a full-gradient stochastic first order oracle, but it is hard to 
prove the same rates for a coordinate descent procedure without smoothness assumptions. 

Given a target function accuracy e instead of query budget T, a similar randomized 
coordinate descent procedure to ours achieves the minimax rate with a similar proof, but it 
is non-adaptive since we presently don’t have an adaptive active learning procedure when 
given e. As of now, we know no adaptive UC optimization procedure when given e. 

Recently, m analysed stochastic gradient descent with averaging, and show that for 
smooth functions, it is possible for an algorithm to automatically adapt between convex¬ 
ity and strong convexity, and in comparision we show how to adapt to unknown uniform 
convexity (strong convexity being a special case of k = 2). It may be possible to combine 
the ideas from this paper and m to get a universally adaptive algorithm from convex to 
all degrees of uniform convexity. It would also be interesting to see if these ideas extend to 
connections between convex optimization and learning linear threshold functions. 

In this paper, we exploit recently discovered theoretical connections by providing explicit 
algorithms that take advantage of them. We show how these could lead to cross-fertilization 
of fields in both directions and hope that this is just the beginning of a flourishing interaction 
where these insights may lead to many new algorithms if we leverage the theoretical relations 
in more innovative ways. 
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